SeqKit is a cross-platform ultra-fast comprehensive toolkit for FASTA/Q processing. It works on Linux, macOS and Windows, and can be directly used without any dependencies or pre-configuration. Further information can be found on the project website and technical details are available in the research article.
In this tutorial we are going to learn how to use SeqKit to process and manipulate FASTA/Q files. The first thing to do is create a directory to store all the tutorial data. It is good practice to create a new directory for each project you work on, this ensures files do not get mixed up and all the results are self-contained.
Create a ‘tutorial’ directory to store output files:
mkdir tutorial
Download the tutorial and exercise data:
curl https://raw.githubusercontent.com/zifornd/bioinformatics-workshop/main/data/formats/data.tar.gz --output tutorial/data.tar.gz
## % Total % Received % Xferd Average Speed Time Time Time Current
## Dload Upload Total Spent Left Speed
##
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
0 13.2M 0 114k 0 0 104k 0 0:02:09 0:00:01 0:02:08 105k
2 13.2M 2 387k 0 0 184k 0 0:01:13 0:00:02 0:01:11 184k
3 13.2M 3 522k 0 0 168k 0 0:01:20 0:00:03 0:01:17 168k
5 13.2M 5 719k 0 0 174k 0 0:01:17 0:00:04 0:01:13 175k
6 13.2M 6 879k 0 0 172k 0 0:01:18 0:00:05 0:01:13 176k
8 13.2M 8 1135k 0 0 186k 0 0:01:12 0:00:06 0:01:06 204k
10 13.2M 10 1455k 0 0 205k 0 0:01:06 0:00:07 0:00:59 214k
13 13.2M 13 1791k 0 0 220k 0 0:01:01 0:00:08 0:00:53 253k
14 13.2M 14 1903k 0 0 208k 0 0:01:04 0:00:09 0:00:55 236k
14 13.2M 14 1983k 0 0 195k 0 0:01:09 0:00:10 0:00:59 219k
15 13.2M 15 2047k 0 0 184k 0 0:01:13 0:00:11 0:01:02 181k
16 13.2M 16 2175k 0 0 179k 0 0:01:15 0:00:12 0:01:03 144k
17 13.2M 17 2335k 0 0 178k 0 0:01:15 0:00:13 0:01:02 109k
19 13.2M 19 2639k 0 0 186k 0 0:01:12 0:00:14 0:00:58 145k
20 13.2M 20 2815k 0 0 186k 0 0:01:12 0:00:15 0:00:57 167k
21 13.2M 21 2975k 0 0 184k 0 0:01:13 0:00:16 0:00:57 185k
23 13.2M 23 3215k 0 0 188k 0 0:01:12 0:00:17 0:00:55 208k
27 13.2M 27 3743k 0 0 206k 0 0:01:05 0:00:18 0:00:47 281k
31 13.2M 31 4287k 0 0 224k 0 0:01:00 0:00:19 0:00:41 331k
32 13.2M 32 4447k 0 0 221k 0 0:01:01 0:00:20 0:00:41 325k
33 13.2M 33 4559k 0 0 216k 0 0:01:02 0:00:21 0:00:41 317k
34 13.2M 34 4655k 0 0 210k 0 0:01:04 0:00:22 0:00:42 287k
35 13.2M 35 4767k 0 0 206k 0 0:01:05 0:00:23 0:00:42 205k
37 13.2M 37 5023k 0 0 208k 0 0:01:04 0:00:24 0:00:40 148k
38 13.2M 38 5247k 0 0 209k 0 0:01:04 0:00:25 0:00:39 160k
40 13.2M 40 5439k 0 0 208k 0 0:01:05 0:00:26 0:00:39 174k
41 13.2M 41 5615k 0 0 207k 0 0:01:05 0:00:27 0:00:38 192k
42 13.2M 42 5807k 0 0 206k 0 0:01:05 0:00:28 0:00:37 207k
45 13.2M 45 6191k 0 0 212k 0 0:01:03 0:00:29 0:00:34 233k
48 13.2M 48 6575k 0 0 218k 0 0:01:02 0:00:30 0:00:32 265k
51 13.2M 51 6911k 0 0 222k 0 0:01:00 0:00:31 0:00:29 296k
51 13.2M 51 7039k 0 0 219k 0 0:01:01 0:00:32 0:00:29 284k
52 13.2M 52 7167k 0 0 216k 0 0:01:02 0:00:33 0:00:29 272k
55 13.2M 55 7471k 0 0 219k 0 0:01:01 0:00:34 0:00:27 256k
57 13.2M 57 7807k 0 0 222k 0 0:01:00 0:00:35 0:00:25 246k
59 13.2M 59 7999k 0 0 221k 0 0:01:01 0:00:36 0:00:25 217k
59 13.2M 59 8031k 0 0 216k 0 0:01:02 0:00:37 0:00:25 198k
59 13.2M 59 8095k 0 0 212k 0 0:01:03 0:00:38 0:00:25 184k
61 13.2M 61 8287k 0 0 211k 0 0:01:03 0:00:39 0:00:24 163k
62 13.2M 62 8431k 0 0 210k 0 0:01:04 0:00:40 0:00:24 124k
62 13.2M 62 8511k 0 0 207k 0 0:01:05 0:00:41 0:00:24 101k
63 13.2M 63 8607k 0 0 204k 0 0:01:06 0:00:42 0:00:24 115k
64 13.2M 64 8703k 0 0 201k 0 0:01:07 0:00:43 0:00:24 121k
65 13.2M 65 8863k 0 0 200k 0 0:01:07 0:00:44 0:00:23 115k
67 13.2M 67 9119k 0 0 202k 0 0:01:07 0:00:45 0:00:22 137k
68 13.2M 68 9279k 0 0 201k 0 0:01:07 0:00:46 0:00:21 154k
69 13.2M 69 9455k 0 0 200k 0 0:01:07 0:00:47 0:00:20 169k
71 13.2M 71 9631k 0 0 200k 0 0:01:07 0:00:48 0:00:19 186k
72 13.2M 72 9839k 0 0 200k 0 0:01:07 0:00:49 0:00:18 193k
73 13.2M 73 9.7M 0 0 199k 0 0:01:07 0:00:50 0:00:17 179k
76 13.2M 76 10.0M 0 0 201k 0 0:01:07 0:00:51 0:00:16 204k
78 13.2M 78 10.4M 0 0 205k 0 0:01:06 0:00:52 0:00:14 246k
81 13.2M 81 10.8M 0 0 208k 0 0:01:04 0:00:53 0:00:11 287k
84 13.2M 84 11.2M 0 0 212k 0 0:01:03 0:00:54 0:00:09 336k
90 13.2M 90 11.9M 0 0 221k 0 0:01:01 0:00:55 0:00:06 433k
94 13.2M 94 12.4M 0 0 228k 0 0:00:59 0:00:56 0:00:03 499k
100 13.2M 100 13.2M 0 0 237k 0 0:00:56 0:00:56 --:--:-- 590k
Extract the archive file into the tutorial directory:
tar xf tutorial/data.tar.gz --directory=tutorial
The software we are going to use in this tutorial can be installed using the conda package manager. Please refer to the previous conda workshop for details on installing software and creating conda environments.
Create a new environment with SeqKit installed:
conda create --yes --name seqkit seqkit
## Collecting package metadata (current_repodata.json): ...working... done
## Solving environment: ...working... done
##
## ## Package Plan ##
##
## environment location: /opt/miniconda3/envs/seqkit
##
## added / updated specs:
## - seqkit
##
##
## The following NEW packages will be INSTALLED:
##
## seqkit bioconda/osx-64::seqkit-2.3.1-h527b516_0
##
##
## Preparing transaction: ...working... done
## Verifying transaction: ...working... done
## Executing transaction: ...working... done
## #
## # To activate this environment, use
## #
## # $ conda activate seqkit
## #
## # To deactivate an active environment, use
## #
## # $ conda deactivate
##
## Retrieving notices: ...working... done
Activate the new environment to use it:
conda activate seqkit
Test that the seqkit command is available:
which seqkit
## /opt/miniconda3/envs/seqkit/bin/seqkit
We will start the tutorial by performing some basic operations on FASTA/Q files. The most popular operations include transforming sequences, calculating statistics, and extracting sub-sequences.
Display help information for seq subcommand:
seqkit seq --help
## transform sequences (extract ID, filter by length, remove gaps, reverse complement...)
##
## Usage:
## seqkit seq [flags]
##
## Flags:
## -k, --color colorize sequences - to be piped into "less -R"
## -p, --complement complement sequence, flag '-v' is recommended to switch on
## --dna2rna DNA to RNA
## -G, --gap-letters string gap letters (default "- \t.")
## -h, --help help for seq
## -l, --lower-case print sequences in lower case
## -M, --max-len int only print sequences shorter than the maximum length (-1 for no limit) (default -1)
## -R, --max-qual float only print sequences with average quality less than this limit (-1 for no limit) (default -1)
## -m, --min-len int only print sequences longer than the minimum length (-1 for no limit) (default -1)
## -Q, --min-qual float only print sequences with average quality qreater or equal than this limit (-1 for no limit) (default -1)
## -n, --name only print names
## -i, --only-id print ID instead of full head
## -q, --qual only print qualities
## -b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33)
## -g, --remove-gaps remove gaps
## -r, --reverse reverse sequence
## --rna2dna RNA to DNA
## -s, --seq only print sequences
## -u, --upper-case print sequences in upper case
## -v, --validate-seq validate bases according to the alphabet
## -V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Only print sequence names:
seqkit seq --name tutorial/data/example/genes.fasta
## GENE-1
## GENE-2
## GENE-3
## GENE-4
## GENE-5
## GENE-6
## GENE-7
## GENE-8
## GENE-9
## GENE-10
Only print sequences:
seqkit seq --seq tutorial/data/example/genes.fasta
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCTGTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACTCACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATCTTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTTCGTCCGGGTGTGACCGAAAGGTAAG
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTTTTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCAGAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTTTTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACCAGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGACAAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCCAATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGATAACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATAATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATCAAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCACTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTTCTTGCTTTCGTGGTATTCTTGC
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGGAACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCCAACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGGACTTTTAAAGTTTCCAT
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTACCAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATACGAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGCACTCA
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTTGTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGTAGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCACTTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
Filter by sequence length:
# Only print sequences shorter than 250 characters
seqkit seq --max-len 250 tutorial/data/example/genes.fasta
## [33m[WARN][0m you may switch on flag -g/--remove-gaps to remove spaces
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
Display help information for stats subcommand:
seqkit stats --help
## simple statistics of FASTA/Q files
##
## Tips:
## 1. For lots of small files (especially on SDD), use big value of '-j' to
## parallelize counting.
##
## Usage:
## seqkit stats [flags]
##
## Aliases:
## stats, stat
##
## Flags:
## -a, --all all statistics, including quartiles of seq length, sum_gap, N50
## -b, --basename only output basename of files
## -E, --fq-encoding string fastq quality encoding. available values: 'sanger', 'solexa', 'illumina-1.3+', 'illumina-1.5+', 'illumina-1.8+'. (default "sanger")
## -G, --gap-letters string gap letters (default "- .")
## -h, --help help for stats
## -e, --skip-err skip error, only show warning message
## -i, --stdin-label string label for replacing default "-" for stdin (default "-")
## -T, --tabular output in machine-friendly tabular format
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Calculate simple statistics:
seqkit stats tutorial/data/example/genes.fasta
## file format type num_seqs sum_len min_len avg_len max_len
## tutorial/data/example/genes.fasta FASTA DNA 10 1,771 77 177.1 335
Output simple statistics in tabular format:
seqkit stats --tabular tutorial/data/example/genes.fasta
## file format type num_seqs sum_len min_len avg_len max_len
## tutorial/data/example/genes.fasta FASTA DNA 10 1771 77 177.1 335
Calculate advanced statistics:
seqkit stats --all tutorial/data/example/genes.fasta
## file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%) GC(%)
## tutorial/data/example/genes.fasta FASTA DNA 10 1,771 77 177.1 335 102 166 232 0 185 0 0 39.3
Display help information for subseq subcommand:
seqkit subseq --help
## get subsequences by region/gtf/bed, including flanking sequences.
##
## Attentions:
## 1. Use "seqkit grep" for extract subsets of sequences.
## "seqtk subseq seqs.fasta id.txt" equals to
## "seqkit grep -f id.txt seqs.fasta"
##
## Recommendation:
## 1. use plain FASTA file, so seqkit could utilize FASTA index.
##
## The definition of region is 1-based and with some custom design.
##
## Examples:
##
## 1-based index 1 2 3 4 5 6 7 8 9 10
## negative index 0-9-8-7-6-5-4-3-2-1
## seq A C G T N a c g t n
## 1:1 A
## 2:4 C G T
## -4:-2 c g t
## -4:-1 c g t n
## -1:-1 n
## 2:-2 C G T N a c g t
## 1:-1 A C G T N a c g t n
## 1:12 A C G T N a c g t n
## -12:-1 A C G T N a c g t n
##
## Usage:
## seqkit subseq [flags]
##
## Flags:
## --bed string by tab-delimited BED file
## --chr strings select limited sequence with sequence IDs when using --gtf or --bed (multiple value supported, case ignored)
## -d, --down-stream int down stream length
## --feature strings select limited feature types (multiple value supported, case ignored, only works with GTF)
## --gtf string by GTF (version 2.2) file
## --gtf-tag string output this tag as sequence comment (default "gene_id")
## -h, --help help for subseq
## -f, --only-flank only return up/down stream sequence
## -r, --region string by region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases, 13:-1 for cutting first 12 bases. type "seqkit subseq -h" for more examples
## -u, --up-stream int up stream length
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Get the first 10 bases from each sequence:
seqkit subseq --region 1:10 tutorial/data/example/genes.fasta
## [INFO][0m create FASTA index for tutorial/data/example/genes.fasta
## >GENE-1
## ATTAAAGGTT
## >GENE-2
## ATGGAGAGCC
## >GENE-3
## ATGTTTGTTT
## >GENE-4
## ATGGATTTGT
## >GENE-5
## ATGTACTCAT
## >GENE-6
## ATGGCAGATT
## >GENE-7
## ATGTTTCATC
## >GENE-8
## ATGAAAATTA
## >GENE-9
## ATGATTGAAC
## >GENE-10
## ATGAAATTTC
Get the last 10 bases from each sequence:
seqkit subseq --region -10:-1 tutorial/data/example/genes.fasta
## >GENE-1
## GAAAGGTAAG
## >GENE-2
## GATGCTCGAA
## >GENE-3
## CTTTAGATTC
## >GENE-4
## CTTCTTGCTG
## >GENE-5
## GTATTCTTGC
## >GENE-6
## CTCTGGCTGT
## >GENE-7
## AAGTTTCCAT
## >GENE-8
## TTAGCACTCA
## >GENE-9
## ACTTGAACTG
## >GENE-10
## CTAGAAAATC
Next, we will look at how to convert between the FASTA/Q file formats. Remember, it only makes sense to convert from FASTQ to FASTA and not the other way around.
Display help information for fq2fa subcommand:
seqkit fq2fa --help
## convert FASTQ to FASTA
##
## Usage:
## seqkit fq2fa [flags]
##
## Flags:
## -h, --help help for fq2fa
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Convert FASTQ to FASTA:
seqkit fq2fa tutorial/data/example/reads.fastq
## >READ-1
## CCTAAGAATACTGGAGAAATTTAGCTGGAAAGATAC
## >READ-2
## ACCATCCTGTGGTGAGGTAGCGGGTTCTCTGCGTCTCT
## >READ-3
## AGATTCTGCAAAGGAATGTCCAGCATCTGAAAATAA
## >READ-4
## CTGCTTCAAGGATGAATTCAGGGACCTTCAGACCCTGT
## >READ-5
## GTGGGTGGATTGATGGAAGTCCCTTGGCTGGCTGGCTG
## >READ-6
## ACATCTTTACGAGTTATATCCTTCTGCAAACCAACAT
## >READ-7
## ATCATTTTTGTAGAAAATCTTCTACCACATGTTAGAC
## >READ-8
## ACAGGTGGCATTTAGCATCTTACTATCCTGAAAGAACT
## >READ-9
## TGCTGGGTGGTCCTGAAGGAACCCCCCAGCTCTCTGCT
## >READ-10
## AAAAAGTGGAAAATTTAGAAATGTCCACTGTAGGACTT
Display help information for fx2tab subcommand:
seqkit fx2tab --help
## convert FASTA/Q to tabular format, and provide various information,
## like sequence length, GC content/GC skew.
##
## Attention:
## 1. Fixed three columns (ID, sequence, quality) are outputted for either FASTA
## or FASTQ, except when flag -n/--name is on. This is for format compatibility.
##
## Usage:
## seqkit fx2tab [flags]
##
## Flags:
## -a, --alphabet print alphabet letters
## -q, --avg-qual print average quality of a read
## -B, --base-content strings print base content. (case ignored, multiple values supported) e.g. -B AT -B N
## -C, --base-count strings print base count. (case ignored, multiple values supported) e.g. -C AT -C N
## -I, --case-sensitive calculate case sensitive base content/sequence hash
## -g, --gc print GC content
## -G, --gc-skew print GC-Skew
## -H, --header-line print header line
## -h, --help help for fx2tab
## -l, --length print sequence length
## -n, --name only print names (no sequences and qualities)
## -Q, --no-qual only output two column even for FASTQ file
## -i, --only-id print ID instead of full head
## -b, --qual-ascii-base int ASCII BASE, 33 for Phred+33 (default 33)
## -s, --seq-hash print hash (MD5) of sequence
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Output FASTQ in tabular format:
seqkit fx2tab tutorial/data/example/reads.fastq
## READ-1 CCTAAGAATACTGGAGAAATTTAGCTGGAAAGATAC AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-2 ACCATCCTGTGGTGAGGTAGCGGGTTCTCTGCGTCTCT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-3 AGATTCTGCAAAGGAATGTCCAGCATCTGAAAATAA AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-4 CTGCTTCAAGGATGAATTCAGGGACCTTCAGACCCTGT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEEE
## READ-5 GTGGGTGGATTGATGGAAGTCCCTTGGCTGGCTGGCTG AAAAAEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEE
## READ-6 ACATCTTTACGAGTTATATCCTTCTGCAAACCAACAT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
## READ-7 ATCATTTTTGTAGAAAATCTTCTACCACATGTTAGAC AAAAAEEAEEEEEEEE6EEEEEEEEEEEEEEEE/EEE
## READ-8 ACAGGTGGCATTTAGCATCTTACTATCCTGAAAGAACT AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
## READ-9 TGCTGGGTGGTCCTGAAGGAACCCCCCAGCTCTCTGCT AAAAAEEEEEEAEEE6EEEA6EEEEEE/E/EEAEAEAA
## READ-10 AAAAAGTGGAAAATTTAGAAATGTCCACTGTAGGACTT AAAAAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
Output specific FASTQ information in tabular format:
# print sequence name, length, and GC content
seqkit fx2tab --name --length --gc tutorial/data/example/reads.fastq
## READ-1 36 36.11
## READ-2 38 57.89
## READ-3 36 36.11
## READ-4 38 50.00
## READ-5 38 60.53
## READ-6 37 35.14
## READ-7 37 29.73
## READ-8 38 39.47
## READ-9 38 63.16
## READ-10 38 31.58
Now we will look at how to search for small sub-sequences within the sequences stored in FASTA/Q files.
Display help information for grep subcommand:
seqkit grep --help
## search sequences by ID/name/sequence/sequence motifs, mismatch allowed
##
## Attentions:
##
## 0. By default, we match sequence ID with patterns, use "-n/--by-name"
## for matching full name instead of just ID.
## 1. Unlike POSIX/GNU grep, we compare the pattern to the whole target
## (ID/full header) by default. Please switch "-r/--use-regexp" on
## for partly matching.
## 2. When searching by sequences, it's partly matching, and both positive
## and negative strands are searched.
## Mismatch is allowed using flag "-m/--max-mismatch", you can increase
## the value of "-j/--threads" to accelerate processing.
## 3. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
## But do not use degenerate bases/residues in regular expression, you need
## convert them to regular expression, e.g., change "N" or "X" to ".".
## 4. When providing search patterns (motifs) via flag '-p',
## please use double quotation marks for patterns containing comma,
## e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
## parser accepts comma-separated-values (CSV) for multiple values (motifs).
## Patterns in file do not follow this rule.
## 5. The order of sequences in result is consistent with that in original
## file, not the order of the query patterns.
## But for FASTA file, you can use:
## seqkit faidx seqs.fasta --infile-list IDs.txt
## 6. For multiple patterns, you can either set "-p" multiple times, i.e.,
## -p pattern1 -p pattern2, or give a file of patterns via "-f/--pattern-file".
##
## You can specify the sequence region for searching with the flag -R (--region).
## The definition of region is 1-based and with some custom design.
##
## Examples:
##
## 1-based index 1 2 3 4 5 6 7 8 9 10
## negative index 0-9-8-7-6-5-4-3-2-1
## seq A C G T N a c g t n
## 1:1 A
## 2:4 C G T
## -4:-2 c g t
## -4:-1 c g t n
## -1:-1 n
## 2:-2 C G T N a c g t
## 1:-1 A C G T N a c g t n
## 1:12 A C G T N a c g t n
## -12:-1 A C G T N a c g t n
##
## Usage:
## seqkit grep [flags]
##
## Flags:
## -n, --by-name match by full name instead of just ID
## -s, --by-seq search subseq on seq, both positive and negative strand are searched, and mismatch allowed using flag -m/--max-mismatch
## -c, --circular circular genome
## -C, --count just print a count of matching records. with the -v/--invert-match flag, count non-matching records
## -d, --degenerate pattern/motif contains degenerate base
## --delete-matched delete a pattern right after being matched, this keeps the firstly matched data and speedups when using regular expressions
## -h, --help help for grep
## -i, --ignore-case ignore case
## -I, --immediate-output print output immediately, do not use write buffer
## -v, --invert-match invert the sense of matching, to select non-matching records
## -m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
## -P, --only-positive-strand only search on positive strand
## -p, --pattern strings search pattern (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"'))
## -f, --pattern-file string pattern file (one record per line)
## -R, --region string specify sequence region for searching. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases
## -r, --use-regexp patterns are regular expression
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Extract sequences with matching name:
seqkit grep --by-name --pattern "gene-GU280_gp04" tutorial/data/example/genes.fasta
Extract sequences containing AGGCG subsequence:
seqkit grep --by-seq --pattern "AGGCG" tutorial/data/example/genes.fasta
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
Extract sequences containing AGGCG in the first N bases:
# sequence region for searching (1:10 for first 10 bases)
seqkit grep --by-seq --pattern "AGGCG" --region 1:10 tutorial/data/example/genes.fasta
Display help information for locate subcommand:
seqkit locate --help
## locate subsequences/motifs, mismatch allowed
##
## Attentions:
##
## 1. Motifs could be EITHER plain sequence containing "ACTGN" OR regular
## expression like "A[TU]G(?:.{3})+?[TU](?:AG|AA|GA)" for ORFs.
## 2. Degenerate bases/residues like "RYMM.." are also supported by flag -d.
## But do not use degenerate bases/residues in regular expression, you need
## convert them to regular expression, e.g., change "N" or "X" to ".".
## 3. When providing search patterns (motifs) via flag '-p',
## please use double quotation marks for patterns containing comma,
## e.g., -p '"A{2,}"' or -p "\"A{2,}\"". Because the command line argument
## parser accepts comma-separated-values (CSV) for multiple values (motifs).
## Patterns in file do not follow this rule.
## 4. Mismatch is allowed using flag "-m/--max-mismatch",
## you can increase the value of "-j/--threads" to accelerate processing.
## 5. When using flag --circular, end position of matched subsequence that
## crossing genome sequence end would be greater than sequence length.
##
## Usage:
## seqkit locate [flags]
##
## Flags:
## --bed output in BED6 format
## -c, --circular circular genome. type "seqkit locate -h" for details
## -d, --degenerate pattern/motif contains degenerate base
## --gtf output in GTF format
## -h, --help help for locate
## -M, --hide-matched do not show matched sequences
## -i, --ignore-case ignore case
## -I, --immediate-output print output immediately, do not use write buffer
## -m, --max-mismatch int max mismatch when matching by seq. For large genomes like human genome, using mapping/alignment tools would be faster
## -G, --non-greedy non-greedy mode, faster but may miss motifs overlapping with others
## -P, --only-positive-strand only search on positive strand
## -p, --pattern strings pattern/motif (multiple values supported. Attention: use double quotation marks for patterns containing comma, e.g., -p '"A{2,}"')
## -f, --pattern-file string pattern/motif file (FASTA format)
## -F, --use-fmi use FM-index for much faster search of lots of sequence patterns
## -r, --use-regexp patterns/motifs are regular expression
## -V, --validate-seq-length int length of sequence to validate (0 for whole seq) (default 10000)
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Locate a subsequence in each sequence:
# exact match
seqkit locate --pattern "ACGT" tutorial/data/example/genes.fasta
## seqID patternName pattern strand start end matched
## GENE-2 ACGT ACGT + 38 41 ACGT
## GENE-2 ACGT ACGT + 74 77 ACGT
## GENE-2 ACGT ACGT + 84 87 ACGT
## GENE-2 ACGT ACGT + 126 129 ACGT
## GENE-2 ACGT ACGT + 216 219 ACGT
## GENE-2 ACGT ACGT - 216 219 ACGT
## GENE-2 ACGT ACGT - 126 129 ACGT
## GENE-2 ACGT ACGT - 84 87 ACGT
## GENE-2 ACGT ACGT - 74 77 ACGT
## GENE-2 ACGT ACGT - 38 41 ACGT
## GENE-3 ACGT ACGT + 99 102 ACGT
## GENE-3 ACGT ACGT - 99 102 ACGT
## GENE-5 ACGT ACGT + 31 34 ACGT
## GENE-5 ACGT ACGT - 31 34 ACGT
Locate a subsequence with N mismatches:
# allow 1 mismatch
seqkit locate --pattern "ACGT" --max-mismatch 1 tutorial/data/example/genes.fasta
## seqID patternName pattern strand start end matched
## GENE-1 ACGT ACGT + 6 9 AGGT
## GENE-1 ACGT ACGT + 14 17 ACCT
## GENE-1 ACGT ACGT + 22 25 AGGT
## GENE-1 ACGT ACGT + 39 42 ACTT
## GENE-1 ACGT ACGT + 70 73 ACGA
## GENE-1 ACGT ACGT + 74 77 ACTT
## GENE-1 ACGT ACGT + 122 125 ACGC
## GENE-1 ACGT ACGT + 150 153 TCGT
## GENE-1 ACGT ACGT + 163 166 ACGA
## GENE-1 ACGT ACGT + 172 175 TCGT
## GENE-1 ACGT ACGT + 196 199 ACGG
## GENE-1 ACGT ACGT + 202 205 TCGT
## GENE-1 ACGT ACGT + 206 209 CCGT
## GENE-1 ACGT ACGT + 229 232 ACAT
## GENE-1 ACGT ACGT + 235 238 AGGT
## GENE-1 ACGT ACGT + 240 243 TCGT
## GENE-1 ACGT ACGT + 259 262 AGGT
## GENE-1 ACGT ACGT - 259 262 ACCT
## GENE-1 ACGT ACGT - 240 243 ACGA
## GENE-1 ACGT ACGT - 235 238 ACCT
## GENE-1 ACGT ACGT - 229 232 ATGT
## GENE-1 ACGT ACGT - 206 209 ACGG
## GENE-1 ACGT ACGT - 202 205 ACGA
## GENE-1 ACGT ACGT - 196 199 CCGT
## GENE-1 ACGT ACGT - 172 175 ACGA
## GENE-1 ACGT ACGT - 163 166 TCGT
## GENE-1 ACGT ACGT - 150 153 ACGA
## GENE-1 ACGT ACGT - 122 125 GCGT
## GENE-1 ACGT ACGT - 74 77 AAGT
## GENE-1 ACGT ACGT - 70 73 TCGT
## GENE-1 ACGT ACGT - 39 42 AAGT
## GENE-1 ACGT ACGT - 22 25 ACCT
## GENE-1 ACGT ACGT - 14 17 AGGT
## GENE-1 ACGT ACGT - 6 9 ACCT
## GENE-2 ACGT ACGT + 26 29 ACGA
## GENE-2 ACGT ACGT + 38 41 ACGT
## GENE-2 ACGT ACGT + 65 68 AGGT
## GENE-2 ACGT ACGT + 74 77 ACGT
## GENE-2 ACGT ACGT + 80 83 TCGT
## GENE-2 ACGT ACGT + 84 87 ACGT
## GENE-2 ACGT ACGT + 101 104 CCGT
## GENE-2 ACGT ACGT + 110 113 AGGT
## GENE-2 ACGT ACGT + 126 129 ACGT
## GENE-2 ACGT ACGT + 132 135 ACAT
## GENE-2 ACGT ACGT + 148 151 ACTT
## GENE-2 ACGT ACGT + 164 167 AAGT
## GENE-2 ACGT ACGT + 176 179 GCGT
## GENE-2 ACGT ACGT + 189 192 ACTT
## GENE-2 ACGT ACGT + 203 206 ATGT
## GENE-2 ACGT ACGT + 216 219 ACGT
## GENE-2 ACGT ACGT - 216 219 ACGT
## GENE-2 ACGT ACGT - 203 206 ACAT
## GENE-2 ACGT ACGT - 189 192 AAGT
## GENE-2 ACGT ACGT - 176 179 ACGC
## GENE-2 ACGT ACGT - 164 167 ACTT
## GENE-2 ACGT ACGT - 148 151 AAGT
## GENE-2 ACGT ACGT - 132 135 ATGT
## GENE-2 ACGT ACGT - 126 129 ACGT
## GENE-2 ACGT ACGT - 110 113 ACCT
## GENE-2 ACGT ACGT - 101 104 ACGG
## GENE-2 ACGT ACGT - 84 87 ACGT
## GENE-2 ACGT ACGT - 80 83 ACGA
## GENE-2 ACGT ACGT - 74 77 ACGT
## GENE-2 ACGT ACGT - 65 68 ACCT
## GENE-2 ACGT ACGT - 38 41 ACGT
## GENE-2 ACGT ACGT - 26 29 TCGT
## GENE-3 ACGT ACGT + 1 4 ATGT
## GENE-3 ACGT ACGT + 99 102 ACGT
## GENE-3 ACGT ACGT + 122 125 AAGT
## GENE-3 ACGT ACGT + 144 147 ACAT
## GENE-3 ACGT ACGT + 158 161 ACTT
## GENE-3 ACGT ACGT + 168 171 ACCT
## GENE-3 ACGT ACGT + 182 185 ATGT
## GENE-3 ACGT ACGT + 187 190 ACTT
## GENE-3 ACGT ACGT + 204 207 ACAT
## GENE-3 ACGT ACGT + 206 209 ATGT
## GENE-3 ACGT ACGT + 232 235 AGGT
## GENE-3 ACGT ACGT + 289 292 AAGT
## GENE-3 ACGT ACGT + 296 299 ACAT
## GENE-3 ACGT ACGT + 325 328 ACTT
## GENE-3 ACGT ACGT - 325 328 AAGT
## GENE-3 ACGT ACGT - 296 299 ATGT
## GENE-3 ACGT ACGT - 289 292 ACTT
## GENE-3 ACGT ACGT - 232 235 ACCT
## GENE-3 ACGT ACGT - 206 209 ACAT
## GENE-3 ACGT ACGT - 204 207 ATGT
## GENE-3 ACGT ACGT - 187 190 AAGT
## GENE-3 ACGT ACGT - 182 185 ACAT
## GENE-3 ACGT ACGT - 168 171 AGGT
## GENE-3 ACGT ACGT - 158 161 AAGT
## GENE-3 ACGT ACGT - 144 147 ATGT
## GENE-3 ACGT ACGT - 122 125 ACTT
## GENE-3 ACGT ACGT - 99 102 ACGT
## GENE-3 ACGT ACGT - 1 4 ACAT
## GENE-4 ACGT ACGT + 40 43 ACTT
## GENE-4 ACGT ACGT + 51 54 AGGT
## GENE-4 ACGT ACGT + 100 103 ACGA
## GENE-4 ACGT ACGT + 146 149 GCGT
## GENE-4 ACGT ACGT + 153 156 ACTT
## GENE-4 ACGT ACGT - 153 156 AAGT
## GENE-4 ACGT ACGT - 146 149 ACGC
## GENE-4 ACGT ACGT - 100 103 TCGT
## GENE-4 ACGT ACGT - 51 54 ACCT
## GENE-4 ACGT ACGT - 40 43 AAGT
## GENE-5 ACGT ACGT + 1 4 ATGT
## GENE-5 ACGT ACGT + 11 14 TCGT
## GENE-5 ACGT ACGT + 27 30 AGGT
## GENE-5 ACGT ACGT + 31 34 ACGT
## GENE-5 ACGT ACGT + 47 50 GCGT
## GENE-5 ACGT ACGT + 51 54 ACTT
## GENE-5 ACGT ACGT + 68 71 TCGT
## GENE-5 ACGT ACGT - 68 71 ACGA
## GENE-5 ACGT ACGT - 51 54 AAGT
## GENE-5 ACGT ACGT - 47 50 ACGC
## GENE-5 ACGT ACGT - 31 34 ACGT
## GENE-5 ACGT ACGT - 27 30 ACCT
## GENE-5 ACGT ACGT - 11 14 ACGA
## GENE-5 ACGT ACGT - 1 4 ACAT
## GENE-6 ACGT ACGT + 14 17 ACGG
## GENE-6 ACGT ACGT + 26 29 CCGT
## GENE-6 ACGT ACGT + 62 65 ACCT
## GENE-6 ACGT ACGT + 72 75 AGGT
## GENE-6 ACGT ACGT + 88 91 ACAT
## GENE-6 ACGT ACGT + 130 133 AGGT
## GENE-6 ACGT ACGT + 148 151 AAGT
## GENE-6 ACGT ACGT - 148 151 ACTT
## GENE-6 ACGT ACGT - 130 133 ACCT
## GENE-6 ACGT ACGT - 88 91 ATGT
## GENE-6 ACGT ACGT - 72 75 ACCT
## GENE-6 ACGT ACGT - 62 65 AGGT
## GENE-6 ACGT ACGT - 26 29 ACGG
## GENE-6 ACGT ACGT - 14 17 CCGT
## GENE-7 ACGT ACGT + 1 4 ATGT
## GENE-7 ACGT ACGT + 11 14 TCGT
## GENE-7 ACGT ACGT + 17 20 ACTT
## GENE-7 ACGT ACGT + 23 26 AGGT
## GENE-7 ACGT ACGT + 61 64 ACTT
## GENE-7 ACGT ACGT + 68 71 AAGT
## GENE-7 ACGT ACGT - 68 71 ACTT
## GENE-7 ACGT ACGT - 61 64 AAGT
## GENE-7 ACGT ACGT - 23 26 ACCT
## GENE-7 ACGT ACGT - 17 20 AAGT
## GENE-7 ACGT ACGT - 11 14 ACGA
## GENE-7 ACGT ACGT - 1 4 ACAT
## GENE-8 ACGT ACGT + 40 43 ACTT
## GENE-8 ACGT ACGT + 75 78 AGGT
## GENE-8 ACGT ACGT + 87 90 ACTT
## GENE-8 ACGT ACGT + 99 102 ACCT
## GENE-8 ACGT ACGT + 115 118 ACAT
## GENE-8 ACGT ACGT + 119 122 ACGA
## GENE-8 ACGT ACGT + 169 172 ACTT
## GENE-8 ACGT ACGT - 169 172 AAGT
## GENE-8 ACGT ACGT - 119 122 TCGT
## GENE-8 ACGT ACGT - 115 118 ATGT
## GENE-8 ACGT ACGT - 99 102 AGGT
## GENE-8 ACGT ACGT - 87 90 AAGT
## GENE-8 ACGT ACGT - 75 78 ACCT
## GENE-8 ACGT ACGT - 40 43 AAGT
## GENE-9 ACGT ACGT + 9 12 ACTT
## GENE-9 ACGT ACGT + 23 26 ACTT
## GENE-9 ACGT ACGT + 93 96 ACTT
## GENE-9 ACGT ACGT - 93 96 AAGT
## GENE-9 ACGT ACGT - 23 26 AAGT
## GENE-9 ACGT ACGT - 9 12 AAGT
## GENE-10 ACGT ACGT + 57 60 ATGT
## GENE-10 ACGT ACGT + 72 75 ATGT
## GENE-10 ACGT ACGT + 81 84 ACAT
## GENE-10 ACGT ACGT + 92 95 ATGT
## GENE-10 ACGT ACGT + 106 109 CCGT
## GENE-10 ACGT ACGT + 119 122 ACTT
## GENE-10 ACGT ACGT - 119 122 AAGT
## GENE-10 ACGT ACGT - 106 109 ACGG
## GENE-10 ACGT ACGT - 92 95 ACAT
## GENE-10 ACGT ACGT - 81 84 ATGT
## GENE-10 ACGT ACGT - 72 75 ACAT
## GENE-10 ACGT ACGT - 57 60 ACAT
Locate a subsequence using regular expressions:
# match AAGC or GGGC
seqkit locate --pattern "(A{2}|G{2})GC" --use-regexp tutorial/data/example/genes.fasta
## seqID patternName pattern strand start end matched
## GENE-1 (A{2}|G{2})GC (A{2}|G{2})GC - 192 195 AAGC
## GENE-1 (A{2}|G{2})GC (A{2}|G{2})GC - 109 112 AAGC
## GENE-2 (A{2}|G{2})GC (A{2}|G{2})GC - 198 201 GGGC
## GENE-2 (A{2}|G{2})GC (A{2}|G{2})GC - 155 158 AAGC
## GENE-2 (A{2}|G{2})GC (A{2}|G{2})GC - 89 92 AAGC
## GENE-3 (A{2}|G{2})GC (A{2}|G{2})GC - 277 280 AAGC
## GENE-4 (A{2}|G{2})GC (A{2}|G{2})GC + 46 49 AAGC
## GENE-4 (A{2}|G{2})GC (A{2}|G{2})GC + 113 116 AAGC
## GENE-4 (A{2}|G{2})GC (A{2}|G{2})GC - 135 138 AAGC
## GENE-5 (A{2}|G{2})GC (A{2}|G{2})GC - 64 67 AAGC
## GENE-6 (A{2}|G{2})GC (A{2}|G{2})GC + 43 46 AAGC
## GENE-6 (A{2}|G{2})GC (A{2}|G{2})GC - 36 39 AAGC
## GENE-8 (A{2}|G{2})GC (A{2}|G{2})GC + 123 126 GGGC
## GENE-8 (A{2}|G{2})GC (A{2}|G{2})GC - 173 176 AAGC
## GENE-8 (A{2}|G{2})GC (A{2}|G{2})GC - 48 51 AAGC
## GENE-9 (A{2}|G{2})GC (A{2}|G{2})GC - 72 75 AAGC
## GENE-9 (A{2}|G{2})GC (A{2}|G{2})GC - 35 38 AAGC
Display help information for sample subcommand:
seqkit sample --help
## sample sequences by number or proportion.
##
## Attention:
## 1. Do not use '-n' on large FASTQ files, it loads all seqs into memory!
## use 'seqkit sample -p 0.1 seqs.fq.gz | seqkit head -n N' instead!
##
## Usage:
## seqkit sample [flags]
##
## Flags:
## -h, --help help for sample
## -n, --number int sample by number (result may not exactly match), DO NOT use on large FASTQ files.
## -p, --proportion float sample by proportion
## -s, --rand-seed int rand seed (default 11)
## -2, --two-pass 2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Sample sequences by number:
# set seed to reproduce result
seqkit sample --number 5 --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO][0m sample by number
## [INFO][0m loading all sequences into memory...
## [INFO][0m 4 sequences outputted
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
Sample sequences by proportion:
# set seed to reproduce result
seqkit sample --proportion 0.25 --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO][0m sample by proportion
## [INFO][0m 3 sequences outputted
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
Display help information for split subcommand:
seqkit split --help
## split sequences into files by name ID, subsequence of given region,
## part size or number of parts.
##
## If you just want to split by parts or sizes, please use "seqkit split2",
## which also applies for paired- and single-end FASTQ.
##
## The definition of region is 1-based and with some custom design.
##
## Examples:
##
## 1-based index 1 2 3 4 5 6 7 8 9 10
## negative index 0-9-8-7-6-5-4-3-2-1
## seq A C G T N a c g t n
## 1:1 A
## 2:4 C G T
## -4:-2 c g t
## -4:-1 c g t n
## -1:-1 n
## 2:-2 C G T N a c g t
## 1:-1 A C G T N a c g t n
## 1:12 A C G T N a c g t n
## -12:-1 A C G T N a c g t n
##
## Usage:
## seqkit split [flags]
##
## Flags:
## -i, --by-id split squences according to sequence ID
## --by-id-prefix string file prefix for --by-id
## -p, --by-part int split sequences into N parts
## --by-part-prefix string file prefix for --by-part
## -r, --by-region string split squences according to subsequence of given region. e.g 1:12 for first 12 bases, -12:-1 for last 12 bases. type "seqkit split -h" for more examples
## --by-region-prefix string file prefix for --by-region
## -s, --by-size int split sequences into multi parts with N sequences
## --by-size-prefix string file prefix for --by-size
## -d, --dry-run dry run, just print message and no files will be created.
## -e, --extension string set output file extension, e.g., ".gz", ".xz", or ".zst"
## -f, --force overwrite output directory
## -h, --help help for split
## -k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
## -O, --out-dir string output directory (default value is $infile.split)
## -2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Split sequences into files by name:
seqkit split --by-id tutorial/data/example/genes.fasta
## [INFO][0m split by ID. idRegexp: ^(\S+)\s?
## [INFO][0m read sequences ...
## [INFO][0m read 10 sequences
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-3.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-4.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-6.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-7.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-8.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-1.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-5.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-9.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-10.fasta
## [INFO][0m write 1 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_GENE-2.fasta
Split sequences into multiple files with N sequences:
# 2 sequences per file
seqkit split --by-size 2 tutorial/data/example/genes.fasta
## [33m[WARN][0m outdir not empty: tutorial/data/example/genes.fasta.split, you can use --force to overwrite
## [INFO][0m split into 2 seqs per file
## [INFO][0m write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_001.fasta
## [INFO][0m write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_002.fasta
## [INFO][0m write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_003.fasta
## [INFO][0m write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_004.fasta
## [INFO][0m write 2 sequences to file: tutorial/data/example/genes.fasta.split/genes.part_005.fasta
Display help information for replace subcommand:
seqkit replace --help
## replace name/sequence by regular expression.
##
## Note that the replacement supports capture variables.
## e.g. $1 represents the text of the first submatch.
## ATTENTION: use SINGLE quote NOT double quotes in *nix OS.
##
## Examples: Adding space to all bases.
##
## seqkit replace -p "(.)" -r '$1 ' -s
##
## Or use the \ escape character.
##
## seqkit replace -p "(.)" -r "\$1 " -s
##
## more on: http://bioinf.shenwei.me/seqkit/usage/#replace
##
## Special replacement symbols (only for replacing name not sequence):
##
## {nr} Record number, starting from 1
## {kv} Corresponding value of the key (captured variable $n) by key-value file,
## n can be specified by flag -I (--key-capt-idx) (default: 1)
##
## Special cases:
## 1. If replacements contain '$',
## a). If using '{kv}', you need use '$$$$' instead of a single '$':
## -r '{kv}' -k <(sed 's/\$/$$$$/' kv.txt)
## b). If not, use '$$':
## -r 'xxx$$xx'
##
## Usage:
## seqkit replace [flags]
##
## Flags:
## -s, --by-seq replace seq (only FASTA)
## -h, --help help for replace
## -i, --ignore-case ignore case
## -K, --keep-key keep the key as value when no value found for the key (only for sequence name)
## -U, --keep-untouch do not change anything when no value found for the key (only for sequence name)
## -I, --key-capt-idx int capture variable index of key (1-based) (default 1)
## -m, --key-miss-repl string replacement for key with no corresponding value
## -k, --kv-file string tab-delimited key-value file for replacing key with value when using "{kv}" in -r (--replacement) (only for sequence name)
## --nr-width int minimum width for {nr} in flag -r/--replacement. e.g., formatting "1" to "001" by --nr-width 3 (default 1)
## -p, --pattern string search regular expression
## -r, --replacement string replacement. supporting capture variables. e.g. $1 represents the text of the first submatch. ATTENTION: for *nix OS, use SINGLE quote NOT double quotes or use the \ escape character. Record number is also supported by "{nr}".use ${1} instead of $1 when {kv} given!
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Replace name by regular expression:
seqkit replace --pattern "-" --replacement "_" tutorial/data/example/genes.fasta
## >GENE_1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE_2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE_3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE_4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE_5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE_6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE_7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE_8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE_9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE_10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
Replace sequence by regular expression:
seqkit replace --by-seq --pattern "AGC" --replacement "---" tutorial/data/example/genes.fasta
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGC---CGATCATC---ACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-2
## ATGGAG---CTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAAC---CCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGA---AAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACA---CTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAAT---GTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAG---TTAAAA---TCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTAT---AGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTG---TTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCT---TGATAACAAATTTGCACTGACTTGCTTT---
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTT---CTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGT---TGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGG---TAGAAAATC
Display help information for sort subcommand:
seqkit sort --help
## sort sequences by id/name/sequence/length.
##
## By default, all records will be readed into memory.
## For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
## supported.
##
## Firstly, seqkit reads the sequence head and length information.
## If the file is not plain FASTA file,
## seqkit will write the sequences to temporary files, and create FASTA index.
##
## Secondly, seqkit sorts sequence by head and length information
## and extracts sequences by FASTA index.
##
## Usage:
## seqkit sort [flags]
##
## Flags:
## -b, --by-bases by non-gap bases
## -l, --by-length by sequence length
## -n, --by-name by full name instead of just id
## -s, --by-seq by sequence
## -G, --gap-letters string gap letters (default "- \t.")
## -h, --help help for sort
## -i, --ignore-case ignore case
## -k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
## -N, --natural-order sort in natural order, when sorting by IDs/full name
## -r, --reverse reverse the result
## -L, --seq-prefix-length int length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
## -2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Sort file by sequence name:
seqkit sort --by-name tutorial/data/example/genes.fasta
## [INFO][0m read sequences ...
## [INFO][0m 10 sequences loaded
## [INFO][0m sorting ...
## [INFO][0m output ...
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
Sort file by sequence length:
seqkit sort --by-length tutorial/data/example/genes.fasta
## [INFO][0m read sequences ...
## [INFO][0m 10 sequences loaded
## [INFO][0m sorting ...
## [INFO][0m output ...
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
Display help information for shuffle subcommand:
seqkit shuffle --help
## shuffle sequences.
##
## By default, all records will be readed into memory.
## For FASTA format, use flag -2 (--two-pass) to reduce memory usage. FASTQ not
## supported.
##
## Firstly, seqkit reads the sequence IDs. If the file is not plain FASTA file,
## seqkit will write the sequences to temporary files, and create FASTA index.
##
## Secondly, seqkit shuffles sequence IDs and extract sequences by FASTA index.
##
## Usage:
## seqkit shuffle [flags]
##
## Flags:
## -h, --help help for shuffle
## -k, --keep-temp keep temporary FASTA and .fai file when using 2-pass mode
## -s, --rand-seed int rand seed for shuffle (default 23)
## -2, --two-pass two-pass mode read files twice to lower memory usage. (only for FASTA format)
##
## Global Flags:
## --alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
## --id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
## --id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
## --infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
## -w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
## -o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
## --quiet be quiet and do not show extra information
## -t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
## -j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
Shuffle sequences:
# set seed to reproduce result
seqkit shuffle --rand-seed 1701 tutorial/data/example/genes.fasta
## [INFO][0m read sequences ...
## [INFO][0m 10 sequences loaded
## [INFO][0m shuffle ...
## [INFO][0m output ...
## >GENE-5
## ATGTACTCATTCGTTTCGGAAGAGACAGGTACGTTAATAGTTAATAGCGTACTTCTTTTT
## CTTGCTTTCGTGGTATTCTTGC
## >GENE-3
## ATGTTTGTTTTTCTTGTTTTATTGCCACTAGTCTCTAGTCAGTGTGTTAATCTTACAACC
## AGAACTCAATTACCCCCTGCATACACTAATTCTTTCACACGTGGTGTTTATTACCCTGAC
## AAAGTTTTCAGATCCTCAGTTTTACATTCAACTCAGGACTTGTTCTTACCTTTCTTTTCC
## AATGTTACTTGGTTCCATGCTATACATGTCTCTGGGACCAATGGTACTAAGAGGTTTGAT
## AACCCTGTCCTACCATTTAATGATGGTGTTTATTTTGCTTCCACTGAGAAGTCTAACATA
## ATAAGAGGCTGGATTTTTGGTACTACTTTAGATTC
## >GENE-8
## ATGAAAATTATTCTTTTCTTGGCACTGATAACACTCGCTACTTGTGAGCTTTATCACTAC
## CAAGAGTGTGTTAGAGGTACAACAGTACTTTTAAAAGAACCTTGCTCTTCTGGAACATAC
## GAGGGCAATTCACCATTTCATCCTCTAGCTGATAACAAATTTGCACTGACTTGCTTTAGC
## ACTCA
## >GENE-6
## ATGGCAGATTCCAACGGTACTATTACCGTTGAAGAGCTTAAAAAGCTCCTTGAACAATGG
## AACCTAGTAATAGGTTTCCTATTCCTTACATGGATTTGTCTTCTACAATTTGCCTATGCC
## AACAGGAATAGGTTTTTGTATATAATTAAGTTAATTTTCCTCTGGCTGT
## >GENE-9
## ATGATTGAACTTTCATTAATTGACTTCTATTTGTGCTTTTTAGCCTTTCTGCTATTCCTT
## GTTTTAATTATGCTTATTATCTTTTGGTTCTCACTTGAACTG
## >GENE-1
## ATTAAAGGTTTATACCTTCCCAGGTAACAAACCAACCAACTTTCGATCTCTTGTAGATCT
## GTTCTCTAAACGAACTTTAAAATCTGTGTGGCTGTCACTCGGCTGCATGCTTAGTGCACT
## CACGCAGTATAATTAATAACTAATTACTGTCGTTGACAGGACACGAGTAACTCGTCTATC
## TTCTGCAGGCTGCTTACGGTTTCGTCCGTGTTGCAGCCGATCATCAGCACATCTAGGTTT
## CGTCCGGGTGTGACCGAAAGGTAAG
## >GENE-2
## ATGGAGAGCCTTGTCCCTGGTTTCAACGAGAAAACACACGTCCAACTCAGTTTGCCTGTT
## TTACAGGTTCGCGACGTGCTCGTACGTGGCTTTGGAGACTCCGTGGAGGAGGTCTTATCA
## GAGGCACGTCAACATCTTAAAGATGGCACTTGTGGCTTAGTAGAAGTTGAAAAAGGCGTT
## TTGCCTCAACTTGAACAGCCCTATGTGTTCATCAAACGTTCGGATGCTCGAA
## >GENE-10
## ATGAAATTTCTTGTTTTCTTAGGAATCATCACAACTGTAGCTGCATTTCACCAAGAATGT
## AGTTTACAGTCATGTACTCAACATCAACCATATGTAGTTGATGACCCGTGTCCTATTCAC
## TTCTATTCTAAATGGTATATTAGAGTAGGAGCTAGAAAATC
## >GENE-4
## ATGGATTTGTTTATGAGAATCTTCACAATTGGAACTGTAACTTTGAAGCAAGGTGAAATC
## AAGGATGCTACTCCTTCAGATTTTGTTCGCGCTACTGCAACGATACCGATACAAGCCTCA
## CTCCCTTTCGGATGGCTTATTGTTGGCGTTGCACTTCTTGCTG
## >GENE-7
## ATGTTTCATCTCGTTGACTTTCAGGTTACTATAGCAGAGATATTACTAATTATTATGAGG
## ACTTTTAAAGTTTCCAT
The exercies below are designed to strengthen your knowledge of using SeqKit for FASTA/Q parsing. The solution to each problem is blurred, only after attempting to solve the problem yourself should you look at the solution. Should you need any help, please ask one of the instructors.
Create a directory to store the output files from each exercise:
mkdir exercises
mkdir exercises/covid19
mkdir exercises/malaria
mkdir exercises/leishmania
Plasmodium falciparum is the malarial parasite most dangerous to humans, accounting for over 90% of all malarial deaths, and was the first species of the genus Plasmodium to be sequenced. Its genome is notable for an exceptionally low GC content of under 20%. In other respects, the genome has a similar size (23.3 Mb) and gene count (about 5300) to other species of malaria.
The malaria directory contains a FASTA file of all the cDNA sequences in the parasite genome. The file was downloaded from the Plasmodium falciparum portal maintained by the Ensembl genome database. Using the SeqKit software, answer the follow questions about the parasite genes:
# 5,515
seqkit stats tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa > exercises/malaria/total-genes.txt
# PF3D7_0628100
seqkit sort --by-length --reverse tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa | seqkit head -n 1 | seqkit seq --name > exercises/malaria/longest-gene.txt
## [INFO][0m read sequences ...
## [INFO][0m 5515 sequences loaded
## [INFO][0m sorting ...
## [INFO][0m output ...
# 14.90 to 50.98
seqkit watch --fields GC --img exercises/malaria/genes-gc.png tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa
## [2J GC content
## 0.113┤ _
## 0.110┤ _ █
## 0.101┤ █ _█
## 0.093┤ █_██_
## 0.081┤ █████_
## │ ██████
## 0.072┤ _██████_
## 0.062┤ ████████__
## 0.051┤ _██████████
## 0.046┤ ███████████__
## │ █████████████
## 0.026┤ _█████████████__
## 0.022┤ ████████████████___
## 0.011┤ _███████████████████___
## │──────────────────────────────────────────────────────────────────────────
## │1111111222222222223333333333334444444444445
## │4567899012345567890012345567890012345567890
## │...........................................
## │9764319864310865320875420976421976431986531
## │Count: 5515 Mean: 25.03 Stdev: 4.18 Min: 14.90 Max: 50.98 Precision: 2 Bins: 43
# Search sequences mentioning "chloroquine" and extract the sequence by name
grep "chloroquine" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa
seqkit grep --pattern "CAD50842" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa > exercises/malaria/drug-gene.txt
## >CAD50842 cdna chromosome:ASM276v2:7:403222:406317:1 gene:PF3D7_0709000 gene_biotype:protein_coding transcript_biotype:protein_coding description:chloroquine resistance transporter
seqkit grep --pattern "CAD50842" tutorial/data/malaria/Plasmodium_falciparum.ASM276v2.cdna.all.fa | seqkit sliding --step 1 --window 10 | seqkit watch --fields GC --img exercises/malaria/gc-hist.png
## [2J GC content
## 0.041┤ ______
## 0.040┤ ______│████│
## │ │████││████│
## │ │████││████│
## │ │████││████│
## 0.029┤ │████││████│ ______
## │ │████││████│ │████│
## 0.022┤ ______ │████││████│ │████│
## │ │████│ │████││████│ │████│
## │ │████│ │████││████│ │████│
## 0.014┤ │████│ │████││████│ │████│______
## │ │████│ │████││████│ │████││████│
## 0.006┤______│████│ │████││████│ │████││████│
## 0.006┤│████││████│ │████││████│ │████││████│ ______
## │──────────────────────────────────────────────────────────────────────────
## │0 6 1 1 2 3 3 4 5 5 6
## │. . 2 9 5 1 8 4 0 7 3
## │0 3 . . . . . . . . .
## │0 6 7 1 5 8 2 5 9 3 6
## │Count: 1266 Mean: 28.4 Stdev: 14.3 Min: 0.0 Max: 70.0 Precision: 1 Bins: 11
Leishmania is a Tryanosomatid protozoa and is the parasite responsible for the disease Leishmaniasis. Leishmania is transmitted by the bite of certain species of sand fly and affects the populations of 88 tropical and sub-tropical countries worldwide. The symptoms are cutaneous and muco-cutaneous lesions initially around the bite, the parasite can also migrate causing visceral leishmaniasis affecting the haemopoietic organs.
The leishmania directory contains the Leishmania major reference genome and gene sequences in FASTA format. The file was downloaded from the Plasmodium falciparum portal maintained by the Ensembl genome database. Using the SeqKit software, answer the follow questions about the parasite genome:
# 2,706
seqkit translate tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa | grep -c "[SCNBQZTY]" > exercises/leishmania/total-polar.txt
# 2 CBZ11846, CBZ11847
# 2 CBZ11883, CBZ11882
seqkit rmdup --by-seq --ignore-case --dup-num-file exercises/leishmania/dupnum.txt --out-file exercises/leishmania/dupfile.txt tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa
cat exercises/leishmania/dupnum.txt
## [INFO][0m 2 duplicated records removed
## 2 CBZ11846, CBZ11847
## 2 CBZ11883, CBZ11882
# CBZ11909
seqkit sort --by-length --reverse tutorial/data/leishmania/Leishmania_major.ASM272v2.cdna.all.fa | seqkit range --range 28:28 | seqkit seq --name > exercises/leishmania/name-28-long.txt
## [INFO][0m read sequences ...
## [INFO][0m 202 sequences loaded
## [INFO][0m sorting ...
## [INFO][0m output ...
# 38,139
seqkit grep --pattern "1" tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit locate --pattern "NGRRT|NGRRN" --degenerate | tail -n +2 | wc -l > exercises/leishmania/cas9-sites.txt
# 63.1
seqkit grep --pattern "1" tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit locate --bed --pattern "NGRRT|NGRRN" --degenerate > exercises/leishmania/cas9-sites.bed
seqkit subseq --bed exercises/leishmania/cas9-sites.bed --only-flank --up-stream 20 tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa | seqkit watch --fields GC --img exercises/leishmania/cas9-sites-gc.png
## [INFO][0m read BED file ...
## [INFO][0m 38139 BED features loaded
## [INFO][0m create FASTA index for tutorial/data/leishmania/Leishmania_major.ASM272v2.dna.toplevel.fa
## [2J GC content
## 0.074┤ _
## │ █
## 0.065┤ █ _
## │ █ █
## │ █ █
## 0.049┤ _ █ █
## 0.044┤ _█ █ █ _
## │ ██ █ █ █
## 0.034┤ _ ██ █ █ █
## │ █ ██ █ █ █
## 0.024┤ █ ██ █ █ █ _
## 0.017┤ _ █ ██_█ █ █ █
## │ █ █ ████ █ █ █
## 0.008┤ _ █ █ _████ █ █ █
## │──────────────────────────────────────────────────────────────────────────
## │0257111122223333444455556666777788889999
## │....025702570257025702570257025702570257
## │0505....................................
## │0000050505050505050505050505050505050505
## │Count: 38139 Mean: 63.1 Stdev: 11.3 Min: 0.0 Max: 100.0 Precision: 1 Bins: 40